Objective - Examine Validity of Some “NBA Myths”:

Many factors affect an NBA game. To understand the impact of each factor, three different data sets will be utilized. This enables us to analyze each games at player level as well as team level and check the rationality of certain “NBA Myths”. The following questions will guide our analysis.

Questions:

  1. Shooting
    1. Are players more likely to make shots from a particular location e.g. corner three.
    2. Hot Hand Analysis - the belief that a player can get “hot” during a game and start making every
    3. Warriors are known as “elite” shooters. How much farther do they actually shoot from?
  2. Home Court Advantage - the belief that teams perform better at home then away games.
    1. Is there a location that’s more or less challenging for teams to play at? E.g. Some believe that high elevation in Arizona gives the Nugget an advantage
  3. Regular season do not matter - the belief that elite players do not play at full effort until the playoff
    1. What’s the Warrior’s likelihood of winning when Steph (their top player) is “locked in”.

Data Sets and Sources Information:

Conclusion from analysis is only a good as the data available. This project assumes that the data sets I am using are accurate reflections of the true historical NBA data.

  1. shots : used python to extracted from sports_radar (unknown reliability)
    • Get nba schedule for the Warrior’s last for games: ‘2021-11-30’, ‘2021-12-03’, ‘2021-12-04’, ‘2021-12-06’
    • Provided game_id in the playperplay database .
    • information about every shot attempted in all the games played in those 4 dates.
  2. allnbagames : used python to extract from the rapidapi website . (unknown reliability)
  3. stephcurry_df : from https://data.world/datatouille/stephen-curry-stats (unknown reliability)

Hot Hand Analysis

Data Ingestion and Wrangling

In the data preparation step, added columns to the shots dataset for enhancement to enable ease of use for analyzing. Also created a subsetted of the shots data to analyze the Warriors more closely.

All X and Y Coordinates of Shots Made

NBA court size is 94 X 50 feet. 3 point line is 23ft and 9 inches. Graphically visual of all points made given the timeline of our data set.

Spatial Correlation of Top Players

Any correlations between the x and y distance of their successful shoots?

  1. “Giannis Antetokounmpo”
  2. “Kevin Durant”
  3. “Chris Paul”
  4. “Jimmy Butler”
  5. “Trae Young” 6 “James Harden”
  6. “LeBron James” 8 “Stephen Curry”

This shows where in the arc steph curry can make shots. 1.Does this imply that Stephen Curry can only make shots at the top arc and not ? 2. Jimmy Butler makes a lot of his shots by the rim -so x and y are highly coorelated.

Central Limit Theorem - Mean Shooting Distance.

## The Mean Shooting Distance for the League is 14.4783 feet
##  sample size = 10  Mean =  14.4353 SD =  3.1277 
##  sample size = 20  Mean =  14.564 SD =  2.2072 
##  sample size = 30  Mean =  14.5105 SD =  1.8753 
##  sample size = 40  Mean =  14.4919 SD =  1.5672

The Central Limit Theorem states that as we increase our sample size, our sample mean will reflect closer to the true population mean. Furthermore, the sample variance of our sample mean will decrease and the distribution will become normally distributed. As a result, we are more confident that our sample mean reflects the true population mean. Graphically, we see this. The spread of our histogram decreases as we increase our sample size because of a smaller variance. Furthermore,the true mean is a shooting distance of 14.49feet, so the difference between the population mean and the sample mean becomes smaller as the sample size increases.

Examing Warriors Performance

  1. How much farther can the Warriors shoot compare to the league?

Conclusion: 1. Steph Curry generates the most points, followed by Andrew Wiggins and Jordan Poole. He has a lower average points made per attempt tahn other players, but this may be due to a multitude of reasons. - having more challenging looks (guarded shots) - providing good assists to this teammates - shooting farther

Sampling

There are many sampling methods and the sample results are used to estimate the population characteristics. Employing several sampling methods, the follow analysis is aim to understand the true shooting distance of the Warriors Team. Comparing these numbers with the Leagues’ shooting average will help us quantify how strong a shooting Team the Warriors is.

## The Mean Distance for the GS Warriors is  15.39  feet  
##  Mean from Stratified Sampling of the CSG is  14.34 
##  Mean from Sample Inclusion is  25.49 
##  Mean from Systematic Sampling is  15.15 
##  Mean from SRSWOR is  13.94

Conclusion:

  1. It looks like the most accurate is Systematic Sampling, follow by SRSWOR, then Stratified Sampling then the UPsystematic method. This could be due to the fact that the data is ordered by games. Players perform differently in each game - taking more or less shots /shooting better or worst. So Systematic Sampling may have drawn an equal number of shots from each game.

Hot Hand - Wait Time until next made shot after shot has been made - Exponential Distribution

Is the Wait Time normally distributed? This analysis of hot hand would imply that there’s a peak closer to 0.

## Mean Wait Time for Curry = 1.78 Number of Shots
## Mean Wait Time for Durant = 1.2 Number of Shots
## Mean Wait Time for Jokic = 0.29 Number of Shots
## 
##  Kevin Durant Stephen Curry 
##          2.00          5.25

Conclusions: Durant and Jokic has a shorter wait time until their next made shot compare to Curry. However Curry takes more three points. This would mean that every successful shot of Durants and Jokic amounts to 2 points while Curry’s made shots can contribute 2 - 3 points.

It looks like Jokic has a short wait time until his next successful shot given that he has made a shot. However, this was based on 1 game for Jokic, explaining the white spaces between the blue bars. So, the hot hand myth is inconclusive given the lack of data.

Stephen Curry averages 5.25 threes per game while Kevin Durant only averages 2 threes per game.

Analyzing Home Court Advantage

Data Ingestion, Explore Data, Perform preprocessing and Data Wrangling

After examining the dataset, there are irrelevant records in the dataset

  1. row filter- only keeping:
  • league = standard
  • country = US, USA, and Canada(Toronto Raptors)
  • gameStatus = Finish
  1. Some attributes needs to be converted to the proper data type. The following are Identified:
  • startTimeUTC and endTimeUTC needs to be converted to datetime
    • redefining column gameduration to endTimeUTC - startTimeUTC in minutes
    • current gameduration is unusable because it is a string in hours and minutes
  • Calculating a new column call GameDate by converting startTimeUTC to Date
  1. Need to filter out columns that are irrelevant or has replicated information

  2. not complete data for 2015 or 2016. So we will be focusing on 2017 only.

For each team - what’s the percentage of winning at home home games?

fig <- plot_ly(merge_df, x = ~Wins_Home, y = ~Team, type = 'bar', orientation = 'h', name = 'HomeGameWin',
        marker = list(color = 'rgba(246, 78, 139, 0.6)',
                      line = list(color = 'rgba(246, 78, 139, 1.0)',
                                  width = 1)))
fig <- fig %>% add_trace(x = ~Losses_Home, name = 'HomeGameLoss',text = paste0(round(merge_df$HomeWinningPercentage,2)*100,"%"), textposition = 'outside',
            marker = list(color = 'rgba(58, 71, 80, 0.6)',
                          line = list(color = 'rgba(58, 71, 80, 1.0)',
                                      width = 1)))
fig <- fig %>% layout(barmode = 'stack',
        title = "Number of Wins and Loss at Home Games for all NBA season 2017 - 2018",
         xaxis = list(title = "Number of Home Games"),
         yaxis = list(title ="NBA Teams")
        )
fig
cat(paste("On average, NBA teams has a ", round(mean(merge_df$HomeWinningPercentage) * 100,0),"%", " chance of winning at Home Game, giving them a slight advantage over their opponents.", sep =""))
## On average, NBA teams has a 58% chance of winning at Home Game, giving them a slight advantage over their opponents.

This is only one perspective. There may be many confounding variables e.g. good teams will win at home and away - masking the advantage or home court advantage if there is one. So we cannot make a conclusive statement.

Does the location affect outcomes?

location_df = nba
row.names(location_df) <- 1:nrow(location_df)
#ploting location of 
library(maps)
data(us.cities)

remove_cities = c('Cleveland OH', 'North Atlanta GA', 'West New York NY', 'North Miami Beach FL', 'North Miami FL', 'Portland ME', 'Port Charlotte FL', 'North Las Vegas NV', 'Kansas City KS', 'Seattle Hill-Silver Firs WA', 'South San Francisco CA', 'West Sacramento CA', 'East Los Angeles CA', 'Miami Beach FL')

us.cities = subset(us.cities, !us.cities$name %in% remove_cities)

location_df = nba[, c('seasonYear','hTeam.nickName','vTeam.nickName', 'arena', 'city')]

idx2 <- sapply(location_df$city, grep, us.cities$name)
idx1 <- sapply(seq_along(idx2), function(i) rep(i, length(idx2[[i]])))

location_df = cbind(location_df[unlist(idx1),,drop=F], us.cities[unlist(idx2),,drop=F])

#graphical analysis
graph_loc = unique(location_df[c("city", "name","arena", "lat", "long")])

arena_wins = nba %>% group_by(arena, teamwon, hTeam.nickName
) %>% filter(teamwon == hTeam.nickName) %>% summarise(Wins_in_Arena= n())

geo_wins = merge(x = graph_loc, y = arena_wins, by= "arena", all.x = TRUE)

#GRAPHING WINS BY AREA ON MAP
g <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showland = TRUE,
  landcolor = toRGB("gray95"),
  subunitwidth = 1,
  countrywidth = 1,
  subunitcolor = toRGB("white"),
  countrycolor = toRGB("white")
)

fig <- plot_geo(geo_wins, lat = ~lat, lon = ~long)

fig <- fig %>% add_markers(
    text = ~paste(arena, city,sep = "<br />"),
    size = ~Wins_in_Arena, hoverinfo = "text"
  )
fig <- fig %>% colorbar(title = "Games")
fig <- fig %>% layout(
    title = 'US Map of NBA locations', geo = g
  )
fig

To eliminate others factors that may be affecting the impact of player’s performance in different locations. Denver is the primarly focus. The objective is to calculate the difference between chance of winning for every team not at Devener vs at Denver.

# what's this team's winning percentage at denver? 
denver = subset(nba, arena %in% "Pepsi Center")
teamwon = denver %>% group_by(teamwon) %>% summarise(count = n())
teamvisitng = denver %>% group_by(vTeam.nickName) %>% summarise(count = n())
colnames(teamvisitng) = c("team", "num_games")
colnames(teamwon) = c("team", "num_games_won")

winningatdenver = merge(x = teamvisitng, y = teamwon, by="team" , all.x = TRUE) 

winningatdenver[is.na(winningatdenver)] = 0

winningatdenver$winning_percentage = winningatdenver$num_games_won/winningatdenver$num_games

cat(str_c("League's likelihood of winning at Denver is ", round(mean(winningatdenver$winning_percentage),4)*100, "%", "\n", "League's likelihood of winning at Away Game is ", round(mean(merge_df$AwayWinningPercentage),4)*100,"%", "\n",
"SD is ", round(sd(merge_df$AwayWinningPercentage),4)*100,"%"))
## League's likelihood of winning at Denver is 30.46%
## League's likelihood of winning at Away Game is 40.9%
## SD is 14.5%

Even though teams are less likely to win at Denver, 30.46% is within 1 standard derivation of 40.9% (+/- 14.5%). This means we cannot make any conclusions with high confidence.

fig = plot_ly(type = 'box')
fig = fig %>% add_boxplot(y = merge_df$HomeWinningPercentage, quartilemethod="linear", name="% of Winning at Home Games",
                          jitter = 0.3, pointpos = -1.8, boxpoints = 'all')
fig = fig %>% add_boxplot(y = merge_df$AwayWinningPercentage, quartilemethod="linear", name="% of Winning at  Won at Away Games",
                           jitter = 0.3, pointpos = -1.8, boxpoints = 'all')
fig =  fig %>% layout(title = list(text = "Distribution of Winning Percentage for Teams"),
                      yaxis = list(range=c(0,1)
                                )
                      )
fig

This supports the idea that there’s high variance in Winning Expectation for Teams.

Examing just the Warriors to see if there is a drop in performance given travel and timezone difference.

#warriors performance at home came ver away game:
warriors = nba[ (nba$hTeam.nickName == "Warriors" ) | (nba$vTeam.nickName == "Warriors"), ]

#warriors performance at home games:
warriorshome <- warriors[which(warriors$hTeam.nickName == "Warriors"),]
homedensity <- density(warriorshome$hTeam.score.points)

#warriors performance at away games:
homedensityaway = warriors[which(warriors$vTeam.nickName == "Warriors"), ]
awaydensity = density(homedensityaway$vTeam.score.points)

vline <- function(x = 0, color = "grey") {
  list(
    type = "dash", 
    y0 = 0, 
    y1 = 1, 
    yref = "paper",
    x0 = x, 
    x1 = x, 
    line = list(color = color, alpha=0.6)
  )
}

fig <- plot_ly(x = ~homedensity$x, y = ~homedensity$y, type = 'scatter', mode = 'lines', name = 'Points Made at Home Games', fill = 'tozeroy') %>% layout(shapes = list(vline(median(homedensity$x)), vline(median(awaydensity$x))))

fig <- fig %>% add_trace(x = ~awaydensity$x, y = ~awaydensity$y, name = 'Points Made at Away Games', fill = 'tozeroy')
fig <- fig %>% layout(xaxis = list(title = 'Density of the Points Distribution'),
         yaxis = list(title = 'Density'), 
         title = list(text = "Warriors: Points Distribution at Home vs Away Games between 2015 - 2020", y = 0.95))
fig
cat(str_c("The median of the GSW Total Score is " , median(awaydensity$x), ". \n", "The median of the GSW Total Score at Home Games is", median(homedensity$x), ". \n"))
## The median of the GSW Total Score is 108. 
## The median of the GSW Total Score at Home Games is112.

Warrior’s scoring abilities doesn’t seem to be affected by away games significantly. Only a 4 points difference between at home vs away.

Data Ingestion and Preprocessing for Stephen Curry’s Game stats

This dataset will be used to compare Steph Curry’s player performance during regular season games vs playoff games. For accuracy of performance, the data will be filtered on seasons that the Warriors made it to the playoffs.

Performance During Regular vs Playoff Games

  • Is there a performance difference between Regular Season Games and Playoff Games for Top Players?
  • Focusing only on Steph Curry for this analysis.
# What is Curry's shooting percentage during regular season - including fg and ft
regular = seasons_in_playoff[which(seasons_in_playoff$Type == "REGULAR SEASON STATS") ,]
shooting_percentage = regular$`Successful Shots`/regular$`Total Shots`
dates = regular$Dates
regdf = data.frame(dates, shooting_percentage)

# What is Curry's shooting percentage during conference and finals
postseason = subset(seasons_in_playoff, !(seasons_in_playoff$Type %in% "REGULAR SEASON STATS"))
shooting_percentage = postseason$`Successful Shots`/postseason$`Total Shots`
dates = postseason$Dates
psdf = data.frame(dates, shooting_percentage)

##

#summary(regdf$shooting_percentage)
#summary(psdf$shooting_percentage)
cat(paste("Regular seasson shooting pecentage:", sd(regdf$shooting_percentage), ". \n",
      "Post reson shooting percentage: ", sd(psdf$shooting_percentage), ". \n", 
"since the sd is relatively the same, a t-test is applied. "))
## Regular seasson shooting pecentage: 0.117206780167678 . 
##  Post reson shooting percentage:  0.105224210827332 . 
##  since the sd is relatively the same, a t-test is applied.
# T- test psdf and regdf
t.test(regdf$shooting_percentage, psdf$shooting_percentage)
## 
##  Welch Two Sample t-test
## 
## data:  regdf$shooting_percentage and psdf$shooting_percentage
## t = 1.7141, df = 136.98, p-value = 0.08877
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.00326292  0.04574416
## sample estimates:
## mean of x mean of y 
## 0.4739225 0.4526819
cat(paste("H0: µ1 = µ2 (psdf and regdf means are equal)", "\n","HA: µ1 ≠ µ2 (psdf and regdf  means are not equal)"))
## H0: µ1 = µ2 (psdf and regdf means are equal) 
##  HA: µ1 ≠ µ2 (psdf and regdf  means are not equal)

Conclusion: The p-value = 0.08 from the Welch Two Sample t-test. There’s no significant shooting difference between regular season games and playoff games for Steph Curry.

fig <- plot_ly(x=~regdf$dates, y=~regdf$shooting_percentage, type = 'scatter', name = "regular season games")
fig <- fig %>% add_trace(x=~psdf$dates, y =~psdf$shooting_percentage, name = "playoff games") %>%layout(title = "Shoot Percentage by Dates", yaxis = list(title ="Shooting Percentage"), xaxis = list(title ="Game Dates"))
fig

The spread of Steph Curry’s shooting percentage is relatively similar between Regular Season Games and Playoff Games. However, there are more variance in the Regular Season Games.

Examining Corrlation between Steph Curry PTS Contributions and Team Score:

regular$`Score GS` = as.numeric(regular$`Score GS`)
r = regular$Result
x = c()
for (i in 1:length(r)) {
  if (r[i] == "W"){
    x= c(x,1)
  } else{
    x = c(x, 0)
  }
}  

stats = regular[,c("Score GS",  "3 Points Succesful","PTS", "REB", "AST", "BLK", "STL","TO",  "Minutes")]
stats$Result_Encoded = x
regcorr = cor(stats)

#postseason
postseason$`Score GS` = as.numeric(postseason$`Score GS`)
r = postseason$Result
x = c()
for (i in 1:length(r)) {
  if (r[i] == "W"){
    x= c(x,1)
  } else{
    x = c(x, 0)
  }
}  

psstats = postseason[,c("Score GS",  "3 Points Succesful","PTS", "REB", "AST", "BLK", "STL","TO",  "Minutes")]
psstats$Result_Encoded = x
pscorr =cor(psstats)

fig1 <- plot_ly(x=colnames(regcorr), y=rownames(regcorr), z = regcorr, type = "heatmap", color= c("cyan", "blue")) %>%
    layout(margin = list(l=120),  color= c("cyan", "blue"))


fig2 <- plot_ly(x=colnames(pscorr), y=rownames(pscorr), z = pscorr, type = "heatmap", color= c("cyan", "blue")) %>%
    layout(margin = list(l=120))


fig <- subplot(fig1, fig2, nrows = 2, margin = 0.07) %>% layout(title = "Correlation HeatMap Reg vs Post")
fig

Result_Encoded = 1 if Win, 0 if Loss. The correlation heatmap did not reveal any significant correlation between winning and another factor. Take aways is that Wins are positively correlated with Warriors’ Team Scores.

Small correlations to Point Out: 1. Positively correlated with Steph Curry’s points and Rebounds. 2. Negatively correlated with Steph Curry’s Turn Overs and Playtime. * Further analysis on this topic is to check if Steph is more likely to make more mistakes as he plays longer due to exhaustion and lost of focus.

Impact of Not Resting Top Players

xreg = seq(15, 60, by = 5)
p_win_reg = c()
  
for(i in xreg){
  testdf = subset(stats, (stats$Minutes >= i) & (stats$Minutes < i+5))
  probwins = sum(testdf$Result_Encoded)/nrow(testdf)
  p_win_reg = c(p_win_reg, probwins)
}


xps = seq(15, 60, by = 5)
p_win = c()
  
for(i in xps){
  testdf = subset(psstats, (psstats$Minutes >= i) & (psstats$Minutes < i+5))
  probwins = sum(testdf$Result_Encoded)/nrow(testdf)
  p_win = c(p_win, probwins)
}

fig = plot_ly(x = xps, y = p_win, type = "scatter", name = "playoffs")
fig = fig %>% add_trace(x=xreg, y= p_win_reg, name = "regular seasons") %>% layout(title = "Chance of Win vs Total Game Time", yaxis = list(title = "Probability of Winning"), xaxis = list(title = "Total Game TIme"))
fig

There appears to be a linear negative relationships between Steph Curry’s total time played that the Warriors’ chance of winning. This can be caused by the depth of the Warriors’ Rosters. When the team has no depth, and players are injured, Steph Curry has to play more minutes which decreasing their chance of winning.

Project Conclusion:

May factors affects a team’s performance. In examining the many questions above, we attempt to verify the legitimacy of many NBA Myths.

  1. Hot Hand doesn’t appear to apply for Steph Curry or Kevin Durant. They appear to to be consistent players who makes a successful shot on an averaged of 1-2 attempts. For Jokic, there was not enough data to conclusively state that he is a streaky player. He could just be taking easy shots and making smarter decisions. The x and y coordinates of Jimmy Butler’s shots appears to be highly correlated. A plausible explanation of this is that he’s a rim put back player or a dunker, not a shoot. So he’s really close to the basket.

  2. The Home Team does appear to have a slight advantage. Furthermore, teams appear to have a challenging time winning at Denver in the 2017 season. However, since there’s very high variance in the mean associated with home court advantage, we cannot conclude that it’s valid. Many confounding variabels are at play, e.g. some teams are more dominant given their makeup of players.

  3. Steph Curry is known as an elite player. In the 5 years that he made the playoffs, there is no significant difference between his performance during the regular vs post season. (Granted that he won MVP for two of those seasons, proving that he was playing well during the regular seasons.) There appears to be a negative correlation given the Warriors’ chance of winning and Steph Curry’s play time. Further analysis needs to be conducted to make any conclusive statemetns. Could this be attributed to player exhaustion and loss of focus during long games? Maybe there’s no depth in the team roster and other star players are injured/out?